Prediction of Suppliment Anesthesia

Summary

Utilizing Machine learning techniques to generate value from a data set of Pulp Sensibility. Using supervised learning algorithms to solve the classification problem of predicting the need of suppliment and also try to find out the root cause of this problem

Tools Used

Skills

  • Machine learning
    • Classification
  • Exploratory data analysis
  • Data cleaning
  • Data visualization

Data Collection

We have collected 128 patients data with 20 features including the binary target feature whether a patient Need Supliment or not.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 127 entries, 0 to 126
Data columns (total 20 columns):
 #   Column                                    Non-Null Count  Dtype 
---  ------                                    --------------  ----- 
 0   Patient                                   127 non-null    object
 1   Age                                       127 non-null    int64 
 2   Dental History                            127 non-null    int64 
 3   Medical History                           127 non-null    object
 4   Pain (VAS)                                127 non-null    int64 
 5   Pain ( Duration) days                     127 non-null    int64 
 6   Percussion                                127 non-null    int64 
 7   Palpation                                 127 non-null    int64 
 8   Mobility                                  127 non-null    int64 
 9   PDL involvement                           127 non-null    int64 
 10  Curved Canal                              127 non-null    int64 
 11  Pulp stone or and Calcification           127 non-null    int64 
 12  PDL space                                 127 non-null    int64 
 13  Lamina Dura                               127 non-null    object
 14  Cold test ( VAS) Before anaesthesia       127 non-null    int64 
 15  Cold test (Duration) Before anaesthesia   127 non-null    int64 
 16  EPT ( VAS) before anaesthesia             127 non-null    int64 
 17  EPT current pass                          127 non-null    int64 
 18  EPT (Duration) before anaesthesia         127 non-null    int64 
 19  Need Supliment                            127 non-null    object
dtypes: int64(16), object(4)
memory usage: 20.8+ KB

Data Preview

Patient Age Dental History Medical History Pain (VAS) Pain ( Duration) days Percussion Palpation Mobility PDL involvement Curved Canal Pulp stone or and Calcification PDL space Lamina Dura Cold test ( VAS) Before anaesthesia Cold test (Duration) Before anaesthesia EPT ( VAS) before anaesthesia EPT current pass EPT (Duration) before anaesthesia Need Supliment
0 F 37 0 0 6 30 2 0 0 1 1 0 1 LOSS 0 0 5 32 5 Yes
1 F 47 0 0 2 30 0 0 0 0 0 0 0 0 3 3 0 80 3 Yes
2 F 27 0 0 6 7 0 0 0 0 0 0 0 0 7 23 5 27 19 Yes
3 M 27 0 0 6 7 0 0 0 0 0 0 0 0 7 27 5 21 37 Yes
4 M 23 0 0 4 60 1 0 0 0 1 0 1 0 5 12 2 43 5 Yes

Exploratory Data Analysis

Data Profile Report

Data profile report to explore the contents of the collected data set.

EDA - Explanatory Data Analysis

Categorical Features of Dataset

Patient Dental History Medical History Percussion Palpation Mobility PDL involvement Curved Canal Pulp stone or and Calcification PDL space Lamina Dura Need Supliment
0 F 0 0 2 0 0 1 1 0 1 LOSS Yes
1 F 0 0 0 0 0 0 0 0 0 0 Yes
2 F 0 0 0 0 0 0 0 0 0 0 Yes
3 M 0 0 0 0 0 0 0 0 0 0 Yes
4 M 0 0 1 0 0 0 1 0 1 0 Yes

Numerical Features of Dataset

Age Pain (VAS) Pain ( Duration) days Cold test ( VAS) Before anaesthesia Cold test (Duration) Before anaesthesia EPT ( VAS) before anaesthesia EPT current pass EPT (Duration) before anaesthesia
0 37 6 30 0 0 5 32 5
1 47 2 30 3 3 0 80 3
2 27 6 7 7 23 5 27 19
3 27 6 7 7 27 5 21 37
4 23 4 60 5 12 2 43 5

Target Column (Suppliment Requirements) Data Distribution

Observations:

  • In our Dataset Out of 127 patients, 84 Patients required Suppliment and 43 patients doesn't need any suppliments.
  • Though its an imbalanced dataset but still we can get few insights

Observations:

  • Male Patients need more suppliment than Female

Observations:

  • Though large number of patients don't have any previous medical history but a significant number of Patients who have medical history like "DM" they need for suppliment.

Observations:

  • Though patients who doesn't have previous dental history, still they need to give Suppliment

Observations:

  • Suppliment Required if calcified tissue present on the level of the pulp chamber and roots of the teeth

Observations:

  • Looking at the age distribution of the dataset, we can see there is a bimodal distribution of Age. Mainly our experiment based on two main group (1) 25-34 Age and (2) 55-59 Age

Observations:

  • From above graph shows on average older age people more inclined to suppliment.

Observations:

  • If pain duration is longer then there a huge need of suppliment. If average pain duration is less than 10 days then usually no suppliment needed.

Observations:

  • If Electric pulp testing (EPT) duration is more than 6 min then there is a need for suppliment.

Classification Problem

Feature Engineering

Most machine learning algorithms can only work with numeric data so it was necessary to encode the categorical features into numeric features. As all of the categorical features in the data set are nominal, i.e., their classes have no meaningful order, I used one-hot encoding to convert the categorical features into indicator variables, also known as dummy variables. One-hot encoding creates a new dummy variable for each class in a categorical feature, where a value of 1 for a dummy variable indicates the presence of the class and a value of 0 indicates the absence of the class.

Patient Age Dental History Medical History Pain (VAS) Pain ( Duration) days Percussion Palpation Mobility PDL involvement Curved Canal Pulp stone or and Calcification PDL space Lamina Dura Cold test ( VAS) Before anaesthesia Cold test (Duration) Before anaesthesia EPT ( VAS) before anaesthesia EPT current pass EPT (Duration) before anaesthesia Need Supliment
0 0 37 0 0 6 30 2 0 0 1 1 0 1 1 0 0 5 32 5 1
1 0 47 0 0 2 30 0 0 0 0 0 0 0 0 3 3 0 80 3 1
2 0 27 0 0 6 7 0 0 0 0 0 0 0 0 7 23 5 27 19 1
3 1 27 0 0 6 7 0 0 0 0 0 0 0 0 7 27 5 21 37 1
4 1 23 0 0 4 60 1 0 0 0 1 0 1 0 5 12 2 43 5 1
Patient Age Dental History Pain (VAS) Pain ( Duration) days Percussion Palpation Mobility PDL involvement Curved Canal ... Medical History_0 Medical History_CARD Medical History_DM Medical History_DM, HTN Medical History_DM, HTN, CAD Medical History_HT0 Medical History_HT0, DM Medical History_HTN Medical History_HTN, CARD Medical History_HTN, DM
0 0 37 0 6 30 2 0 0 1 1 ... 1 0 0 0 0 0 0 0 0 0
1 0 47 0 2 30 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
2 0 27 0 6 7 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
3 1 27 0 6 7 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
4 1 23 0 4 60 1 0 0 0 1 ... 1 0 0 0 0 0 0 0 0 0

5 rows × 29 columns

Feature Selection

Feature selection is a method of filtering out the important features as all the features present in the dataset are not equally important. There are some features that have no effect on the output. So we can skip them. As our motive is to reduce the data before feeding it to the training model.

  • We are selecting top 15 Features and for finding we are doing Pearson’s correlation, chi-square and LightGBM test
  • We check the absolute value of the Pearson’s correlation between the target and numerical features in our dataset. We keep the top n features based on this criterion.
  • we calculate the chi-square metric between the target and the numerical variable and only select the variable with the maximum chi-squared values.
  • We can also use RandomForest to select features based on feature importance.In Random forest, the final feature importance is the average of all decision tree feature importance.
  • We could also have used a LightGBM. Or an XGBoost object as long it has a featureimportances attribute.

Finding Top Features by comparing of Feature Selection Techniques

Feature Pearson Chi-2 Random Forest LightGBM Total
1 Pulp stone or and Calcification True True True True 4
2 Pain ( Duration) days True True True True 4
3 Age True True True True 4
4 Percussion True True False True 3
5 Palpation True True False True 3
6 EPT (Duration) before anaesthesia True True True False 3
7 Dental History True True False True 3
8 Curved Canal True True False True 3
9 Cold test (Duration) Before anaesthesia True False True True 3
10 Cold test ( VAS) Before anaesthesia True False True True 3
11 Pain (VAS) False False True True 2
12 Mobility False True False True 2
13 Medical History_DM, HTN True True False False 2
14 Medical History_DM True True False False 2
15 Medical History_0 True True False False 2
16 EPT current pass True False True False 2
17 EPT ( VAS) before anaesthesia True False True False 2
18 Patient False False False True 1
19 PDL space False False False True 1
20 PDL involvement False False False True 1
21 Medical History_HTN, CARD False True False False 1
22 Medical History_HT0, DM False True False False 1
23 Medical History_HT0 False True False False 1
24 Lamina Dura False False False True 1
25 Medical History_HTN, DM False False False False 0
26 Medical History_HTN False False False False 0
27 Medical History_DM, HTN, CAD False False False False 0
28 Medical History_CARD False False False False 0

Final Features

Total numbers of Selected Features for ML Model(including Target Column):  11

1# Need Supliment
2# Pulp stone or and Calcification
3# Pain ( Duration) days
4# Age
5# Percussion 
6# Palpation
7# EPT (Duration) before anaesthesia 
8# Dental History
9# Curved Canal 
10# Cold test (Duration) Before anaesthesia 
11# Cold test ( VAS) Before anaesthesia 


Preview of Data
Need Supliment Pulp stone or and Calcification Pain ( Duration) days Age Percussion Palpation EPT (Duration) before anaesthesia Dental History Curved Canal Cold test (Duration) Before anaesthesia Cold test ( VAS) Before anaesthesia
0 1 0 30 37 2 0 5 0 1 0 0
1 1 0 30 47 0 0 3 0 0 3 3
2 1 0 7 27 0 0 19 0 0 23 7
3 1 0 7 27 0 0 37 0 0 27 7
4 1 0 60 23 1 0 5 0 1 12 5

Model Training

Train/Test Split

We reserved 80% of the observations for the train set and 20% of the observations for the test set.

Normalization of Data

Input variables may have different units (e.g. feet, kilometers, and hours) that, in turn, may mean the variables have different scales.

Differences in the scales across input variables may increase the difficulty of the problem being modeled. An example of this is that large input values (e.g. a spread of hundreds or thousands of units) can result in a model that learns large weight values. A model with large weight values is often unstable, meaning that it may suffer from poor performance during learning and sensitivity to input values resulting in higher generalization error.

Standardization scales each input variable separately by subtracting the mean (called centering) and dividing by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one.

 Preview of Normalize Data
Pulp stone or and Calcification Pain ( Duration) days Age Percussion Palpation EPT (Duration) before anaesthesia Dental History Curved Canal Cold test (Duration) Before anaesthesia Cold test ( VAS) Before anaesthesia
0 1.538397 -0.479279 0.853624 1.371916 1.614665 -0.103195 1.951800 -0.619324 0.449482 0.602527
1 -0.650027 -0.035176 -1.308012 -1.210012 -0.619324 -0.943732 -0.512348 -0.619324 -0.997012 0.217367
2 1.538397 1.123352 1.934441 1.371916 -0.619324 0.569235 -0.512348 1.614665 1.228362 0.217367
3 -0.650027 1.123352 -0.809173 0.511273 -0.619324 -1.111840 -0.512348 1.614665 0.672019 -0.938112
4 -0.650027 -0.575823 -1.058592 -1.210012 -0.619324 -1.111840 -0.512348 -0.619324 -0.997012 -0.552952

Cross-Validation

We performed GridSearch cross-validation to cross-validate the models and tune the hyperparameters.
GridSearch cross-validation for the logistic regression model is performed.

GridSearch cross-validation for the KNN model is performed below.

Model Evaluation

Performance on Train and Test Sets

Having trained and cross-validated the models, I then used the models to make predictions on the test set. I evaluated the performance of the models on the test set using the same F1 and accuracy metrics used to evaluate the models during cross-validation. The performance of the models as indicated by these metrics is displayed below.

Logistic regression F1 (train): 0.793
Logistic regression F1 (test): 0.878 

Logistic regression accuracy (train): 0.723
Logistic regression accuracy (test): 0.808 

Logistic regression Classification Report: 
               precision    recall  f1-score   support

           0       0.50      0.60      0.55         5
           1       0.90      0.86      0.88        21

    accuracy                           0.81        26
   macro avg       0.70      0.73      0.71        26
weighted avg       0.82      0.81      0.81        26

Logistic regression Confusion Matrix: 
 [[ 3  2]
 [ 3 18]]


KNN F1 (train): 0.781
KNN F1 (test): 0.829 

KNN accuracy (train): 0.732
KNN accuracy (test): 0.731 

AUC and ROC

AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1.

Observations:

  • There is a 70% chance that the model will be able to distinguish between positive class and negative class.

Evaluating Bias vs Variance

To objectively determine the degree of bias and variance exhibited by the models, I used the guidelines presented below.

Bias:

  • High bias: F1 < 0.70
  • Medium bias: 0.70 <= F1 < 0.90
  • Low bias: 0.90 <= F1

Variance:

  • High variance: (% difference in F1 between train and test set) > 25%
  • Medium variance: 5% < (% difference in F1 between train and test set) <= 25%
  • Low variance: (% difference in F1 between train and test set) <= 5